Computing machine having improved computing architecture and related system and method

ABSTRACT

A computing machine includes a first buffer and a processor coupled to the buffer. The processor executes an application, a first data-transfer object, and a second data-transfer object, publishes data under the control of the application, loads the published data into the buffer under the control of the first data-transfer object, and retrieves the published data from the buffer under the control of the second data-transfer object. Alternatively, the processor retrieves data and loads the retrieved data into the buffer under the control of the first data-transfer object, unloads the data from the buffer under the control of the second data-transfer object, and processes the unloaded data under the control of the application. Where the computing machine is a peer-vector machine that includes a hardwired pipeline accelerator coupled to the processor, the buffer and data-transfer objects facilitate the transfer of data between the application and the accelerator.

CLAIM OF PRIORITY

[0001] This application claims priority to U.S. Provisional ApplicationSerial No. 60/422,503, filed on Oct. 31, 2002, which is incorporated byreference.

CROSS REFERENCE TO RELATED APPLICATIONS

[0002] This application is related to U.S. patent application Ser. Nos.______ entitled IMPROVED COMPUTING ARCHITECTURE AND RELATED SYSTEM ANDMETHOD (Attorney Docket No. 1934-11-3), ______ entitled PIPELINEACCELERATOR FOR IMPROVED COMPUTING ARCHITECTURE AND RELATED SYSTEM ANDMETHOD (Attorney Docket No. 1934-13-3), ______ entitled PROGRAMMABLECIRCUIT AND RELATED COMPUTING MACHINE AND METHOD (Attorney Docket No.1934-14-3), and ______ entitled PIPELINE ACCELERATOR HAVING MULTIPLEPIPELINE UNITS AND RELATED COMPUTING MACHINE AND METHOD (Attorney DocketNo. 1934-15-3), which have a common filing date and owner and which areincorporated by reference.

BACKGROUND

[0003] A common computing architecture for processing relatively largeamounts of data in a relatively short period of time includes multipleinterconnected processors that share the processing burden. By sharingthe processing burden, these multiple processors can often process thedata more quickly than a single processor can for a given clockfrequency. For example, each of the processors can process a respectiveportion of the data or execute a respective portion of a processingalgorithm.

[0004]FIG. 1 is a schematic block diagram of a conventional computingmachine 10 having a multi-processor architecture. The machine 10includes a master processor 12 and coprocessors 14 ₁-14 _(n), whichcommunicate with each other and the master processor via a bus 16, aninput port 18 for receiving raw data from a remote device (not shown inFIG. 1), and an output port 20 for providing processed data to theremote source. The machine 10 also includes a memory 22 for the masterprocessor 12, respective memories 24 ₁-24 _(n) for the coprocessors 14₁-14 _(n), and a memory 26 that the master processor and coprocessorsshare via the bus 16. The memory 22 serves as both a program and aworking memory for the master processor 12, and each memory 24 ₁-24 _(n)serves as both a program and a working memory for a respectivecoprocessor 14 ₁-14 _(n). The shared memory 26 allows the masterprocessor 12 and the coprocessors 14 to transfer data among themselves,and from/to the remote device via the ports 18 and 20, respectively. Themaster processor 12 and the coprocessors 14 also receive a common clocksignal that controls the speed at which the machine 10 processes the rawdata.

[0005] In general, the computing machine 10 effectively divides theprocessing of raw data among the master processor 12 and thecoprocessors 14. The remote source (not shown in FIG. 1) such as a sonararray loads the raw data via the port 18 into a section of the sharedmemory 26, which acts as a first-in-first-out (FIFO) buffer (not shown)for the raw data. The master processor 12 retrieves the raw data fromthe memory 26 via the bus 16, and then the master processor and thecoprocessors 14 process the raw data, transferring data among themselvesas necessary via the bus 16. The master processor 12 loads the processeddata into another FIFO buffer (not shown) defined in the shared memory26, and the remote source retrieves the processed data from this FIFOvia the port 20.

[0006] In an example of operation, the computing machine 10 processesthe raw data by sequentially performing n+1 respective operations on theraw data, where these operations together compose a processing algorithmsuch as a Fast Fourier Transform (FFT). More specifically, the machine10 forms a data-processing pipeline from the master processor 12 and thecoprocessors 14. For a given frequency of the clock signal, such apipeline often allows the machine 10 to process the raw data faster thana machine having only a single processor.

[0007] After retrieving the raw data from the raw-data FIFO (not shown)in the memory 26, the master processor 12 performs a first operation,such as a trigonometric function, on the raw data. This operation yieldsa first result, which the processor 12 stores in a first-result FIFO(not shown) defined within the memory 26. Typically, the processor 12executes a program stored in the memory 22, and performs theabove-described actions under the control of the program. The processor12 may also use the memory 22 as working memory to temporarily storedata that the processor generates at intermediate intervals of the firstoperation.

[0008] Next, after retrieving the first result from the first-resultFIFO (not shown) in the memory 26, the coprocessor 14 ₁ performs asecond operation, such as a logarithmic function, on the first result.This second operation yields a second result, which the coprocessor 14 ₁stores in a second-result FIFO (not shown) defined within the memory 26.Typically, the coprocessor 14 ₁ executes a program stored in the memory24 ₁, and performs the above-described actions under the control of theprogram. The coprocessor 14 ₁ may also use the memory 24 ₁ as workingmemory to temporarily store data that the coprocessor generates atintermediate intervals of the second operation.

[0009] Then, the coprocessors 24 ₂-24 _(n) sequentially performthird—n^(th) operations on the second—(n−1)^(th) results in a mannersimilar to that discussed above for the coprocessor 24 ₁.

[0010] The n^(th) operation, which is performed by the coprocessor 24_(n), yields the final result, i.e., the processed data. The coprocessor24 _(n) loads the processed data into a processed-data FIFO (not shown)defined within the memory 26, and the remote device (not shown inFIG. 1) retrieves the processed data from this FIFO.

[0011] Because the master processor 12 and coprocessors 14 aresimultaneously performing different operations of the processingalgorithm, the computing machine 10 is often able to process the rawdata faster than a computing machine having a single processor thatsequentially performs the different operations. Specifically, the singleprocessor cannot retrieve a new set of the raw data until it performsall n+1 operations on the previous set of raw data. But using thepipeline technique discussed above, the master processor 12 can retrievea new set of raw data after performing only the first operation.Consequently, for a given clock frequency, this pipeline technique canincrease the speed at which the machine 10 processes the raw data by afactor of approximately n+1 as compared to a single-processor machine(not shown in FIG. 1).

[0012] Alternatively, the computing machine 10 may process the raw datain parallel by simultaneously performing n+1 instances of a processingalgorithm, such as an FFT, on the raw data. That is, if the algorithmincludes n+1 sequential operations as described above in the previousexample, then each of the master processor 12 and the coprocessors 14sequentially perform all n+1 operations on respective sets of the rawdata. Consequently, for a given clock frequency, thisparallel-processing technique, like the above-described pipelinetechnique, can increase the speed at which the machine 10 processes theraw data by a factor of approximately n+1 as compared to asingle-processor machine (not shown in FIG. 1).

[0013] Unfortunately, although the computing machine 10 can process datamore quickly than a single-processor computer machine (not shown in FIG.1), the data-processing speed of the machine 10 is often significantlyless than the frequency of the processor clock. Specifically, thedata-processing speed of the computing machine 10 is limited by the timethat the master processor 12 and coprocessors 14 require to processdata. For brevity, an example of this speed limitation is discussed inconjunction with the master processor 12, although it is understood thatthis discussion also applies to the coprocessors 14. As discussed above,the master processor 12 executes a program that controls the processorto manipulate data in a desired manner. This program includes a sequenceof instructions that the processor 12 executes. Unfortunately, theprocessor 12 typically requires multiple clock cycles to execute asingle instruction, and often must execute multiple instructions toprocess a single value of data. For example, suppose that the processor12 is to multiply a first data value A (not shown) by a second datavalue B (not shown). During a first clock cycle, the processor 12retrieves a multiply instruction from the memory 22. During second andthird clock cycles, the processor 12 respectively retrieves A and B fromthe memory 26. During a fourth clock cycle, the processor 12 multipliesA and B, and, during a fifth clock cycle, stores the resulting productin the memory 22 or 26 or provides the resulting product to the remotedevice (not shown). This is a best-case scenario, because in many casesthe processor 12 requires additional clock cycles for overhead taskssuch as initializing and closing counters. Therefore, at best theprocessor 12 requires five clock cycles, or an average of 2.5 clockcycles per data value, to process A and B.

[0014] Consequently, the speed at which the computing machine 10processes data is often significantly lower than the frequency of theclock that drives the master processor 12 and the coprocessors 14. Forexample, if the processor 12 is clocked at 1.0 Gigahertz (GHz) butrequires an average of 2.5 clock cycles per data value, then theeffective data-processing speed equals (1.0 GHz)/2.5=0.4 GHz. Thiseffective data-processing speed is often characterized in units ofoperations per second. Therefore, in this example, for a clock speed of1.0 GHz, the processor 12 would be rated with a data-processing speed of0.4 Gigaoperations/second (Gops).

[0015]FIG. 2 is a block diagram of a hardwired data pipeline 30 that cantypically process data faster than a processor can for a given clockfrequency, and often at substantially the same rate at which thepipeline is clocked. The pipeline 30 includes operator circuits 32 ₁-32_(n) that each perform a respective operation on respective data withoutexecuting program instructions. That is, the desired operation is“burned in” to a circuit 32 such that it implements the operationautomatically, without the need of program instructions. By eliminatingthe overhead associated with executing program instructions, thepipeline 30 can typically perform more operations per second than aprocessor can for a given clock frequency.

[0016] For example, the pipeline 30 can often solve the followingequation faster than a processor can for a given clock frequency:

Y(x _(k))=(5x _(k)+3)2^(xk|)

[0017] where x_(k) represents a sequence of raw data values. In thisexample, the operator circuit 32 ₁ is a multiplier that calculates5x_(k), the circuit 32 ₂ is an adder that calculates 5x_(k)+3, and thecircuit 32 _(n)(n=3) is a multiplier that calculates (5x_(k)+3)2^(xk|).

[0018] During a first clock cycle k=1, the circuit 32 ₁ receives datavalue x₁ and multiplies it by 5 to generate 5x₁.

[0019] During a second clock cycle k=2, the circuit 32 ₂ receives 5x₁from the circuit 32 ₁ and adds 3 to generate 5x₁+3. Also, during thesecond clock cycle, the circuit 32 ₁ generates 5x₂.

[0020] During a third clock cycle k=3, the circuit 32 ₃ receives 5x₁+3from the circuit 32 ₂and multiplies by 2^(x1|)(effectively left shifts5x₁+3 by x₁) to generate the first result (5x₁+3)2|^(x1|). Also duringthe third clock cycle, the circuit 32 ₁ generates 5x₃ and the circuit 32₂ generates 5x₂+3.

[0021] The pipeline 30 continues processing subsequent raw data valuesx_(k) in this manner until all the raw data values are processed.

[0022] Consequently, a delay of two clock cycles after receiving a rawdata value x₁—this delay is often called the latency of the pipeline30—the pipeline generates the result (5x₁+3)2^(x1|), and thereaftergenerates one result—e.g., (5x₂+3)2^(x2|), (5x₃+3)2^(x3), . . . ,5x_(n)+3)2^(xn|)—each clock cycle.

[0023] Disregarding the latency, the pipeline 30 thus has adata-processing speed equal to the clock speed. In comparison, assumingthat the master processor 12 and coprocessors 14 (FIG. 1) havedata-processing speeds that are 0.4 times the clock speed as in theabove example, the pipeline 30 can process data 2.5 times faster thanthe computing machine 10 (FIG. 1) for a given clock speed.

[0024] Still referring to FIG. 2, a designer may choose to implement thepipeline 30 in a programmable logic IC (PLIC), such as afield-programmable gate array (FPGA), because a PLIC allows more designand modification flexibility than does an application specific IC(ASIC). To configure the hardwired connections within a PLIC, thedesigner merely sets interconnection-configuration registers disposedwithin the PLIC to predetermined binary states. The combination of allthese binary states is often called “firmware.” Typically, the designerloads this firmware into a nonvolatile memory (not shown in FIG. 2) thatis coupled to the PLIC. When one “turns on” the PLIC, it downloads thefirmware from the memory into the interconnection-configurationregisters. Therefore, to modify the functioning of the PLIC, thedesigner merely modifies the firmware and allows the PLIC to downloadthe modified firmware into the interconnection-configuration registers.This ability to modify the PLIC by merely modifying the firmware isparticularly useful during the prototyping stage and for upgrading thepipeline 30 “in the field”.

[0025] Unfortunately, the hardwired pipeline 30 typically cannot executeall algorithms, particularly those that entail significant decisionmaking. A processor can typically execute a decision-making instruction(e.g., conditional instructions such as “if A, then go to B, else go toC”) approximately as fast as it can execute an operational instruction(e.g., “A+B”) of comparable length. But although the pipeline 30 may beable to make a relatively simple decision (e.g., “A>B?”), it typicallycannot execute a relatively complex decision (e.g., “if A, then go to B,else go to C”). And although one may be able to design the pipeline 30to execute such a complex decision, the size and complexity of therequired circuitry often makes such a design impractical, particularlywhere an algorithm includes multiple different complex decisions.

[0026] Consequently, processors are typically used in applications thatrequire significant decision making, and hardwired pipelines aretypically limited to “number crunching” applications that entail littleor no decision making.

[0027] Furthermore, as discussed below, it is typically much easier forone to design/modify a processor-based computing machine, such as thecomputing machine 10 of FIG. 1, than it is to design/modify a hardwiredpipeline such as the pipeline 30 of FIG. 2, particularly where thepipeline 30 includes multiple PLICs.

[0028] Computing components, such as processors and their peripherals(e.g., memory), typically include industry-standard communicationinterfaces that facilitate the interconnection of the components to forma processor-based computing machine.

[0029] Typically, a standard communication interface includes twolayers: a physical layer and a service layer.

[0030] The physical layer includes the circuitry and the correspondingcircuit interconnections that form the interface and the operatingparameters of this circuitry. For example, the physical layer includesthe pins that connect the component to a bus, the buffers that latchdata received from the pins, and the drivers that drive data onto thepins. The operating parameters include the acceptable voltage range ofthe data signals that the pins receive, the signal timing for writingand reading data, and the supported modes of operation (e.g., burstmode, page mode). Conventional physical layers includetransistor-transistor logic (TTL) and RAMBUS.

[0031] The service layer includes the protocol by which a computingcomponent transfers data. The protocol defines the format of the dataand the manner in which the component sends and receives the formatteddata. Conventional communication protocols include file-transferprotocol (FTP) and TCP/IP (expand).

[0032] Consequently, because manufacturers and others typically designcomputing components having industry-standard communication interfaces,one can typically design the interface of such a component andinterconnect it to other computing components with relatively littleeffort. This allows one to devote most of his time to designing theother portions of the computing machine, and to easily modify themachine by adding or removing components.

[0033] Designing a computing component that supports anindustry-standard communication interface allows one to save design timeby using an existing physical-layer design from a design library. Thisalso insures that he/she can easily interface the component tooff-the-shelf computing components.

[0034] And designing a computing machine using computing components thatsupport a common industry-standard communication interface allows thedesigner to interconnect the components with little time and effort.Because the components support a common interface, the designer caninterconnect them via a system bus with little design effort. Andbecause the supported interface is an industry standard, one can easilymodify the machine. For example, one can add different components andperipherals to the machine as the system design evolves, or can easilyadd/design next-generation components as the technology evolves.Furthermore, because the components support a common industry-standardservice layer, one can incorporate into the computing machine's softwarean existing software module that implements the corresponding protocol.Therefore, one can interface the components with little effort becausethe interface design is essentially already in place, and thus can focuson designing the portions (e.g., software) of the machine that cause themachine to perform the desired function(s).

[0035] But unfortunately, there are no known industry-standardcommunication interfaces for components, such as PLICs, used to formhardwired pipelines such as the pipeline 30 of FIG. 2.

[0036] Consequently, to design a pipeline having multiple PLICs, onetypically spends a significant amount of time and exerts a significanteffort designing and debugging the communication interface between thePLICs “from scratch.” Typically, such an ad hoc communication interfacedepends on the parameters of the data being transferred between thePLICs. Likewise, to design a pipeline that interfaces to a processor,one would have to spend a significant amount of time and exert asignificant effort in designing and debugging the communicationinterface between the pipeline and the processor from scratch.

[0037] Similarly, to modify such a pipeline by adding a PLIC to it, onetypically spends a significant amount of time and exerts a significanteffort designing and debugging the communication interface between theadded PLIC and the existing PLICs. Likewise, to modify a pipeline byadding a processor, or to modify a computing machine by adding apipeline, one would have to spend a significant amount of time and exerta significant effort in designing and debugging the communicationinterface between the pipeline and processor.

[0038] Consequently, referring to FIGS. 1 and 2, because of thedifficulties in interfacing multiple PLICs and in interfacing aprocessor to a pipeline, one is often forced to make significanttradeoffs when designing a computing machine. For example, with aprocessor-based computing machine, one is forced to tradenumber-crunching speed and design/modification flexibility for complexdecision-making ability. Conversely, with a hardwired pipeline-basedcomputing machine, one is forced to trade complex-decision-makingability and design/modification flexibility for number-crunching speed.Furthermore, because of the difficulties in interfacing multiple PLICs,it is often impractical for one to design a pipeline-based machinehaving more than a few PLICs. As a result, a practical pipeline-basedmachine often has limited functionality. And because of the difficultiesin interfacing a processor to a PLIC, it would be impractical tointerface a processor to more than one PLIC. As a result, the benefitsobtained by combining a processor and a pipeline would be minimal.

[0039] Therefore, a need has arisen for a new computing architecturethat allows one to combine the decision-making ability of aprocessor-based machine with the number-crunching speed of ahardwired-pipeline-based machine.

SUMMARY

[0040] In an embodiment of the invention, a computing machine includes afirst buffer and a processor coupled to the buffer. The processor isoperable to execute an application, a first data-transfer object, and asecond data-transfer object, publish data under the control of theapplication, load the published data into the buffer under the controlof the first data-transfer object, and retrieve the published data fromthe buffer under the control of the second data-transfer object.

[0041] According to another embodiment of the invention, the processoris operable to retrieve data and load the retrieved data into the bufferunder the control of the first data-transfer object, unload the datafrom the buffer under the control of the second data-transfer object,and process the unloaded data under the control of the application.

[0042] Where the computing machine is a peer-vector machine thatincludes a hardwired pipeline accelerator coupled to the processor, thebuffer and data-transfer objects facilitate the transfer of data—whetherunidirectional or bidirectional—between the application and theaccelerator.

BRIEF DESCRIPTION OF THE DRAWINGS

[0043]FIG. 1 is a block diagram of a computing machine having aconventional multi-processor architecture.

[0044]FIG. 2 is a block diagram of a conventional hardwired pipeline.

[0045]FIG. 3 is schematic block diagram of a computing machine having apeer-vector architecture according to an embodiment of the invention.

[0046]FIG. 4 is a functional block diagram of the host processor of FIG.3 according to an embodiment of the invention.

[0047]FIG. 5 is a functional block diagram of the data-transfer pathsbetween the data-processing application and the pipeline bus of FIG. 4according to an embodiment of the invention.

[0048]FIG. 6 is a functional block diagram of the data-transfer pathsbetween the accelerator exception manager and the pipeline bus of FIG. 4according to an embodiment of the invention.

[0049]FIG. 7 is a functional block diagram of the data-transfer pathsbetween the accelerator configuration manager and the pipeline bus ofFIG. 4 according to an embodiment of the invention.

DETAILED DESCRIPTION

[0050]FIG. 3 is a schematic block diagram of a computing machine 40,which has a peer-vector architecture according to an embodiment of theinvention. In addition to a host processor 42, the peer-vector machine40 includes a pipeline accelerator 44, which performs at least a portionof the data processing, and which thus effectively replaces the bank ofcoprocessors 14 in the computing machine 10 of FIG. 1. Therefore, thehost-processor 42 and the accelerator 44 are “peers” that can transferdata vectors back and forth. Because the accelerator 44 does not executeprogram instructions, it typically performs mathematically intensiveoperations on data significantly faster than a bank of coprocessors canfor a given clock frequency. Consequently, by combing thedecision-making ability of the processor 42 and the number-crunchingability of the accelerator 44, the machine 40 has the same abilities as,but can often process data faster than, a conventional computing machinesuch as the machine 10. Furthermore, as discussed below and inpreviously cited U.S. patent application Ser. No. ______ entitledPIPELINE ACCELERATOR FOR IMPROVED COMPUTING ARCHITECTURE AND RELATEDSYSTEM AND METHOD (Attorney Docket No. 1934-13-3), providing theaccelerator 44 with the same communication interface as the hostprocessor 42 facilitates the design and modification of the machine 40,particularly where the communications interface is an industry standard.And where the accelerator 44 includes multiple components (e.g., PLICs),providing these components with this same communication interfacefacilitates the design and modification of the accelerator, particularlywhere the communication interface is an industry standard. Moreover, themachine 40 may also provide other advantages as described below and inthe previously cited patent applications.

[0051] Still referring to FIG. 3, in addition to the host processor 42and the pipeline accelerator 44, the peer-vector computing machine 40includes a processor memory 46, an interface memory 48, a bus 50, afirmware memory 52, optional raw-data input ports 54 and 56,processed-data output ports 58 and 60, and an optional router 61.

[0052] The host processor 42 includes a processing unit 62 and a messagehandler 64, and the processor memory 46 includes a processing-unitmemory 66 and a handler memory 68, which respectively serve as bothprogram and working memories for the processor unit and the messagehandler. The processor memory 46 also includes anaccelerator-configuration registry 70 and a message-configurationregistry 72, which store respective configuration data that allow thehost processor 42 to configure the functioning of the accelerator 44 andthe structure of the messages that the message handler 64 sends andreceives.

[0053] The pipeline accelerator 44 is disposed on at least one PLIC (notshown) and includes hardwired pipelines 74 ₁-74 _(n), which processrespective data without executing program instructions. The firmwarememory 52 stores the configuration firmware for the accelerator 44. Ifthe accelerator 44 is disposed on multiple PLICs, these PLICs and theirrespective firmware memories may be disposed on multiple circuit boards,i.e., daughter cards (not shown). The accelerator 44 and daughter cardsare discussed further in previously cited U.S. patent application Ser.Nos. ______ entitled PIPELINE ACCELERATOR FOR IMPROVED COMPUTINGARCHITECTURE AND RELATED SYSTEM AND METHOD (Attorney Docket No.1934-13-3) and entitled PIPELINE ACCELERATOR HAVING MULTIPLE PIPELINEUNITS AND RELATED COMPUTING MACHINE AND METHOD (Attorney Docket No.1934-15-3). Alternatively, the accelerator 44 may be disposed on atleast one ASIC, and thus may have internal interconnections that areunconfigurable. In this alternative, the machine 40 may omit thefirmware memory 52. Furthermore, although the accelerator 44 is shownincluding multiple pipelines 74, it may include only a single pipeline.In addition, although not shown, the accelerator 44 may include one ormore processors such as a digital-signal processor (DSP).

[0054] The general operation of the peer-vector machine 40 is discussedin previously cited U.S. patent application Ser. No. ______ entitledIMPROVED COMPUTING ARCHITECTURE AND RELATED SYSTEM AND METHOD (AttorneyDocket No. 1934-11-3), and the functional topology and operation of thehost processor 42 is discussed below in conjunction with FIGS. 4-7. FIG.4 is a functional block diagram of the host processor 42 and thepipeline bus 50 of FIG. 3 according to an embodiment of the invention.Generally, the processing unit 62 executes one or more softwareapplications, and the message handler 64 executes one or more softwareobjects that transfer data between the software application(s) and thepipeline accelerator 44 (FIG. 3). Splitting the data-processing,data-transferring, and other functions among different applications andobjects allows for easier design and modification of the host-processorsoftware. Furthermore, although in the following description a softwareapplication is described as performing a particular operation, it isunderstood that in actual operation, the processing unit 62 or messagehandler 64 executes the software application and performs this operationunder the control of the application. Likewise, although in thefollowing description a software object is described as performing aparticular operation, it is understood that in actual operation, theprocessing unit 62 or message handler 64 executes the software objectand performs this operation under the control of the object.

[0055] Still referring to FIG. 4, the processing unit 62 executes adata-processing application 80, an accelerator exception managerapplication (hereinafter the exception manager) 82, and an acceleratorconfiguration manager application (hereinafter the configurationmanager) 84, which are collectively referred to as the processing-unitapplications. The data-processing application processes data incooperation with the pipeline accelerator 44 (FIG. 3). For example, thedata-processing application 80 may receive raw sonar data via the port54 (FIG. 3), parse the data, and send the parsed data to the accelerator44, and the accelerator may perform an FFT on the parsed data and returnthe processed data to the data-processing application for furtherprocessing. The exception manager 82 handles exception messages from theaccelerator 44, and the configuration manager 84 loads the accelerator'sconfiguration firmware into the memory 52 during initialization of thepeer-vector machine 40 (FIG. 3). The configuration manager 84 may alsoreconfigure the accelerator 44 after initialization in response to,e.g., a malfunction of the accelerator. As discussed further below inconjunction with FIGS. 6-7, the processing-unit applications maycommunicate with each other directly as indicated by the dashed lines85, 87, and 89, or may communicate with each other via the data-transferobjects 86. The message handler 64 executes the data-transfer objects86, a communication object 88, and input and output read objects 90 and92, and may execute input and output queue objects 94 and 96. Thedata-transfer objects 86 transfer data between the communication object88 and the processing-unit applications, and may use the interfacememory 48 as a data buffer to allow the processing-unit applications andthe accelerator 44 to operate independently. For example, the memory 48allows the accelerator 44, which is often faster than thedata-processing application 80, to operate without “waiting” for thedata-processing application. The communication object 88 transfers databetween the data objects 86 and the pipeline bus 50. The input andoutput read objects 90 and 92 control the data-transfer objects 86 asthey transfer data between the communication object 88 and theprocessing-unit applications. And, when executed, the input and outputqueue objects 94 and 96 cause the input and output read objects 90 and92 to synchronize this transfer of data according to a desired priority

[0056] Furthermore, during initialization of the peer-vector machine 40(FIG. 3), the message handler 64 instantiates and executes aconventional object factory 98, which instantiates the data-transferobjects 86 from configuration data stored in the message-configurationregistry 72 (FIG. 3). The message handler 64 also instantiates thecommunication object 88, the input and output reader objects 90 and 92,and the input and output queue objects 94 and 96 from the configurationdata stored in the message-configuration registry 72. Consequently, onecan design and modify these software objects, and thus theirdata-transfer parameters, by merely designing or modifying theconfiguration data stored in the registry 72. This is typically lesstime consuming than designing or modifying each software objectindividually.

[0057] The operation of the host processor 42 of FIG. 4 is discussedbelow in conjunction with FIGS. 5-7.

[0058] Data Processing

[0059]FIG. 5 is a functional block diagram of the data-processingapplication 80, the data-transfer objects 86, and the interface memory48 of FIG. 4 according to an embodiment of the invention.

[0060] The data-processing application 80 includes a number of threads100 ₁-100 _(n), which each perform a respective data-processingoperation. For example, the thread 100 ₁ may perform an addition, andthe thread 100 ₂ may perform a subtraction, or both the threads 100 ₁and 100 ₂ may perform an addition.

[0061] Each thread 100 generates, i.e., publishes, data destined for thepipeline accelerator 44 (FIG. 3), receives, i.e., subscribes to, datafrom the accelerator, or both publishes and subscribes to data. Forexample, each of the threads 100 ₁-100 ₄ both publish and subscribe todata from the accelerator 44. A thread 100 may also communicate directlywith another thread 100. For example, as indicated by the dashed line102, the threads 100 ₃ and 100 ₄ may directly communicate with eachother. Furthermore, a thread 100 may receive data from or send data to acomponent (not shown) other than the accelerator 44 (FIG. 3). But forbrevity, discussion of data transfer between the threads 100 and suchanother component is omitted.

[0062] Still referring to FIG. 5, the interface memory 48 and thedata-transfer objects 86 _(1a)-86 _(nb) functionally form a number ofunidirectional channels 104 ₁₋₁₀₄ _(n) for transferring data between therespective threads 100 and the communication object 88. The interfacememory 48 includes a number of buffers 106 ₁-106 _(n), one buffer perchannel 104. The buffers 106 may each hold a single grouping (e.g.,byte, word, block) of data, or at least some of the buffers may be FIFObuffers that can each store respective multiple groupings of data. Thereare also two data objects 86 per channel 104, one for transferring databetween a respective thread 100 and a respective buffer 106, and theother for transferring data between the buffer 106 and the communicationobject 88. For example, the channel 104 ₁ includes a buffer 106 ₁, adata-transfer object 86 _(1a) for transferring published data from thethread 100 ₁ to the buffer 106 ₁, and a data-transfer object 86 _(1b)for transferring the published data from the buffer 106 ₁ to thecommunication object 88. Including a respective channel 104 for eachallowable data transfer reduces the potential for data bottlenecks andalso facilitates the design and modification of the host processor 42(FIG. 4).

[0063] Referring to FIGS. 3-5, the operation of the host processor 42during its initialization and while executing the data-processingapplication 80, the data-transfer objects 86, the communication object88, and the optional reader and queue objects 90, 92, 94, and 96 isdiscussed according to an embodiment of the invention.

[0064] During initialization of the host processor 42, the objectfactory 98 instantiates the data-transfer objects 86 and defines thebuffers 104. Specifically, the object factory 98 downloads theconfiguration data from the registry 72 and generates the software codefor each data-transfer object 86 _(xb) that the data-processingapplication 80 may need. The identity of the data-transfer objects 86_(xb) that the application 80 may need is typically part of theconfiguration data—the application 80, however, need not use all of thedata-transfer objects 86. Then, from the generated objects 86 _(xb), theobject factory 98 respectively instantiates the data objects 86 _(xa).Typically, as discussed in the example below, the object factory 98instantiates data-transfer objects 86 _(xa) and 86 _(xb) that access thesame buffer 104 as multiple instances of the same software code. Thisreduces the amount of code that the object factory 98 would otherwisegenerate by approximately one half. Furthermore, the message handler 64may determine which, if any, data-transfer objects 86 the application 80does not need, and delete the instances of these unneeded data-transferobjects to save memory. Alternatively, the message handler 64 may makethis determination before the object factory 98 generates thedata-transfer objects 86, and cause the object factory to instantiateonly the data-transfer objects that the application 80 needs. Inaddition, because the data-transfer objects 86 include the addresses ofthe interface memory 48 where the respective buffers 104 are located,the object factory 98 effectively defines the sizes and locations of thebuffers when it instantiates the data-transfer objects.

[0065] For example, the object factory 98 instantiates the data-transferobjects 86 _(1a) and 86 _(1b) in the following manner. First, thefactory 98 downloads the configuration data from the registry 72 andgenerates the common software code for the data-transfer object 86 _(1a)and 86 _(1b). Next, the factory 98 instantiates the data-transferobjects 86 _(1a) and 86 _(1b) as respective instances of the commonsoftware code. That is, the message handler 64 effectively copies thecommon software code to two locations of the handler memory 68 or toother program memory (not shown), and executes one location as theobject 86 _(1a) and the other location as the object 86 _(1b).

[0066] Still referring to FIGS. 3-5, after initialization of the hostprocessor 42, the data-processing application 80 processes data andsends data to and receives data from the pipeline accelerator 44.

[0067] An example of the data-processing application 80 sending data tothe accelerator 44 is discussed in conjunction with the channel 104 ₁.

[0068] First, the thread 100 ₁ generates and publishes data to thedata-transfer object 86 _(1a). The thread 100 ₁ may generate the data byoperating on raw data that it receives from the accelerator 44 (furtherdiscussed below) or from another source (not shown) such as a sonararray or a data base via the port 54.

[0069] Then, the data-object 86 _(1a) loads the published data into thebuffer 106 ₁.

[0070] Next, the data-transfer object 86 _(1b) determines that thebuffer 106 ₁ has been loaded with newly published data from thedata-transfer object 86 _(1a). The output reader object 92 mayperiodically instruct the data-transfer object 86 _(1b) to check thebuffer 106 ₁ for newly published data. Alternatively, the output readerobject 92 notifies the data-transfer object 86 _(1b) when the buffer 106₁ has received newly published data. Specifically, the output queueobject 96 generates and stores a unique identifier (not shown) inresponse to the data-transfer object 86 _(1a) storing the published datain the buffer 106 ₁. In response to this identifier, the output readerobject 92 notifies the data-transfer object 86 _(1b) that the buffer 106₁ contains newly published data. Where multiple buffers 106 containrespective newly published data, then the output queue object 96 mayrecord the order in which this data was published, and the output readerobject 92 may notify the respective data-transfer objects 86 _(xb) inthe same order. Thus, the output reader object 92 and the output queueobject 96 synchronize the data transfer by causing the first datapublished to be the first data that the respective data-transfer object86 _(xb) sends to the accelerator 44, the second data published to bethe second data that the respective data-transfer object 86 _(xb) sendsto the accelerator, etc. In another alternative where multiple buffers106 contain respective newly published data, the output reader andoutput queue objects 92 and 96 may implement a priority scheme otherthan, or in addition to, this first-in-first-out scheme. For example,suppose the thread 100 ₁ publishes first data, and subsequently thethread 100 ₂ publishes second data but also publishes to the outputqueue object 96 a priority flag associated with the second data. Becausethe second data has priority over the first data, the output readerobject 92 notifies the data-transfer object 86 _(2b) of the publishedsecond data in the buffer 106 ₂ before notifying the data-transferobject 86 _(1b) of the published first data in the buffer 106 ₁.

[0071] Then, the data-transfer object 86 _(1b) retrieves the publisheddata from the buffer 106 ₁ and formats the data in a predeterminedmanner. For example, the object 86 _(1b) generates a message thatincludes the published data (i.e., the payload) and a header that, e.g.,identifies the destination of the data within the accelerator 44. Thismessage may have an industry-standard format such as the Rapid IO(input/output) format. Because the generation of such a message isconventional, it is not discussed further.

[0072] After the data-transfer object 86 _(1b) formats the publisheddata, it sends the formatted data to the communication object 88.

[0073] Next, the communication object 88 sends the formatted data to thepipeline accelerator 44 via the bus 50. The communication object 88 isdesigned to implement the communication protocol (e.g., Rapid IO,TCP/IP) used to transfer data between the host processor 42 and theaccelerator 44. For example, the communication object 88 implements therequired hand shaking and other transfer parameters (e.g., arbitratingthe sending and receiving of messages on the bus 50) that the protocolrequires. Alternatively, the data-transfer object 86 _(xb) can implementthe communication protocol, and the communication object 88 can beomitted. However, this latter alternative is less efficient because itrequires all the data-transfer objects 86 _(xb) to include additionalcode and functionality.

[0074] The pipeline accelerator 44 then receives the formatted data,recovers the data from the message (e.g., separates the data from theheader if there is a header), directs the data to the proper destinationwithin the accelerator, and processes the data.

[0075] Still referring to FIGS. 3-5, an example of the pipelineaccelerator 44 (FIG. 3) sending data to the host processor 42 (FIG. 3)is discussed in conjunction with the channel 104 ₂.

[0076] First, the pipeline accelerator 44 generates and formats data.For example, the accelerator 44 generates a message that includes thedata payload and a header that, e.g., identifies the destination threads100 ₁ and 100 ₂, which are the threads that are to receive and processthe data. As discussed above, this message may have an industry-standardformat such as the Rapid IO (input/output) format.

[0077] Next, the accelerator 44 drives the formatted data onto the bus50 in a conventional manner.

[0078] Then, the communication object 88 receives the formatted datafrom the bus 50 and provides the formatted data to the data-transferobject 86 _(2b). In one embodiment, the formatted data is in the form ofa message, and the communication object 88 analyzes the message header(which, as discussed above, identifies the destination threads 100 ₁ and100 ₂) and provides the message to the data-transfer object 86 _(2b) inresponse to the header. In another embodiment, the communication object88 provides the message to all of the data-transfer objects 86 _(nb),each of which analyzes the message header and processes the message onlyif its function is to provide data to the destination threads 100 ₁ and100 ₂. Consequently, in this example, only the data-transfer object 86_(2b) processes the message.

[0079] Next, the data-transfer object 86 _(2b) loads the data receivedfrom the communication object 88 into the buffer 106 ₂. For example, ifthe data is contained within a message payload, the data-transfer object86 _(2b) recovers the data from the message (e.g., by stripping theheader) and loads the recovered data into the buffer 106 ₂.

[0080] Then, the data-transfer object 86 _(2a) determines that thebuffer 106 ₂ has received new data from the data-transfer object 86_(2b). The input reader object 90 may periodically instruct thedata-transfer object 86 _(2a) to check the buffer 106 ₂ for newlyreceived data. Alternatively, the input reader object 90 notifies thedata-transfer object 86 _(2a) when the buffer 106 ₂ has received newlypublished data. Specifically, the input queue object 94 generates andstores a unique identifier (not shown) in response to the data-transferobject 86 _(2b) storing the published data in the buffer 106 ₂. Inresponse to this identifier, the input reader object 90 notifies thedata-transfer object 86 _(2a) that the buffer 106 ₂ contains newlypublished data. As discussed above in conjunction with the output readerand output queue objects 92 and 96, where multiple buffers 106 containrespective newly published data, then the input queue object 94 mayrecord the order in and the input reader object 90 may notify therespective same order. Alternatively, where multiple buffers 106 heddata, the input reader and input queue objects 90 ty scheme other than,or in addition to, this first-in-first-out ject 86 _(2a) transfers thedata from the buffer 106 ₂ to the 00 ₂, which perform respectiveoperations on the data. 5, an example of one thread receiving andprocessing cussed in conjunction with the thread 100 ₄ receiving and thethread 100 ₃. ent, the thread 100 ₃ publishes the data directly to the)nnection (dashed line) 102. diment, the thread 100 ₃ publishes the datato the 045 and 104 ₆. Specifically, the data-transfer object 86 _(5a) hebuffer 10 ₅. Next, the data-transfer object 86 _(5b) fer 10 ₅ andtransfers the data to the communication data to the data-transfer object86 _(6b). Then, the 3 the data into the buffer 106 ₆. Next, thedata-transfer from the buffer 106 ₆ to the thread 100 ₄. Alternatively,transferred via the bus 50, then one may modify the that it loads thedata directly into the buffer 106 ₆, thus object 88 and thedata-transfer object 86 _(6b). But ject 86 _(5b) to be different fromthe other data-transfer omplexity modularity of the message handler 64.FIG. 5, additional data-transfer techniques are single thread maypublish data to multiple locations within IG. 3) via respective multiplechannels. Alternatively, as U.S. patent application Ser. Nos. ______entitled IMPROVED COMPUTING ARCHITECTURE AND RELATED SYSTEM AND METHOD(Attorney Docket No. 1934-11-3) and ______ entitled PIPELINE ACCELERATORFOR IMPROVED COMPUTING ARCHITECTURE AND RELATED SYSTEM AND METHOD(Attorney Docket No. 1934-13-3), the accelerator 44 may receive data viaa single channel 104 and provide it to multiple locations within theaccelerator. Furthermore, multiple threads (e.g., threads 100 ₁ and 100₂) may subscribe to data from the same channel (e.g., channel 104 ₂). Inaddition, multiple threads (e.g., threads 100 ₂ and 100 ₃) may publishdata to the same location within the accelerator 44 via the same channel(e.g., channel 104 ₃), although the threads may publish data to the sameaccelerator location via respective channels 104.

[0081]FIG. 6 is a functional block diagram of the exception manager 82,the data-transfer objects 86, and the interface memory 48 according toan embodiment of the invention.

[0082] The exception manager 82 receives and logs exceptions that mayoccur during the initialization or operation of the pipeline accelerator44 (FIG. 3). Generally, an exception is a designer-defined event wherethe accelerator 44 acts in an undesired manner. For example, a buffer(not shown) that overflows may be an exception, and thus cause theaccelerator 44 to generate an exception message and send it to theexception manager 82. Generation of an exception message is discussed inpreviously cited U.S. patent application Ser. No. ______ entitledPIPELINE ACCELERATOR FOR IMPROVED COMPUTING ARCHITECTURE AND RELATEDSYSTEM AND METHOD (Attorney Docket No. 1934-13-3).

[0083] The exception manager 82 may also handle exceptions that occurduring the initialization or operation of the pipeline accelerator 44(FIG. 3). For example, if the accelerator 44 includes a buffer (notshown) that overflows, then the exception manager 82 may cause theaccelerator to increase the size of the buffer to prevent futureoverflow. Or, if a section of the accelerator 44 malfunctions, theexception manager 82 may cause another section of the accelerator or thedata-processing application 80 to perform the operation that themalfunctioning section was intended to perform. Such exception handlingis further discussed below and in previously cited U.S. patentapplication Ser. No. ______ entitled PIPELINE ACCELERATOR FOR IMPROVEDCOMPUTING ARCHITECTURE AND RELATED SYSTEM AND METHOD (Attorney DocketNo. 1934-13-3).

[0084] To log and/or handle accelerator exceptions, the exceptionmanager 82 subscribes to data from one or more subscriber threads 100(FIG. 5) and determines from this data whether an exception hasoccurred.

[0085] In one alternative, the exception manager 82 subscribes to thesame data as the subscriber threads 100 (FIG. 5) subscribe to.Specifically, the manager 82 receives this data via the same respectivechannels 104 _(s) (which include, e.g., channel 104 ₂ of FIG. 5) fromwhich the subscriber threads 100 (which include, e.g., threads 100 ₁ and100 ₂ of FIG. 5) receive the data. Consequently, the channels 104 _(s)provide this data to the exception manager 82 in the same manner thatthey provide this data to the subscriber threads 100.

[0086] In another alternative, the exception manager 82 subscribes todata from dedicated channels 106 (not shown), which may receive datafrom sections of the accelerator 44 (FIG. 3) that do not provide data tothe threads 100 via the subscriber channels 104 _(s). Where suchdedicated channels 104 are used, the object factory 98 (FIG. 4)generates the data-transfer objects 86 for these channels duringinitialization of the host processor 42 as discussed above inconjunction with FIG. 4. The exception manager 82 may subscribe to thededicated channels 106 exclusively or in addition to the subscriberchannels 104 _(s).

[0087] To determine whether an exception has occurred, the exceptionmanager 82 compares the data to exception codes stored in a registry(not shown) within the memory 66 (FIG. 3). If the data matches one ofthe codes, then the exception manager 82 determines that the exceptioncorresponding to the matched code has occurred.

[0088] In another alternative, the exception manager 82 analyzes thedata to determine if an exception has occurred. For example, the datamay represent the result of an operation performed by the accelerator44. The exception manager 82 determines whether the data contains anerror, and, if so, determines that an exception has occurred and theidentity of the exception.

[0089] After determining that an exception has occurred, the exceptionmanager 82 logs, e.g., the corresponding exception code and the time ofoccurrence, for later use such as during a debug of the accelerator 44.The exception manager 82 may also determine and convey the identity ofthe exception to, e.g., the system designer, in a conventional manner.

[0090] Alternatively, in addition to logging the exception, theexception manager 82 may implement an appropriate procedure for handlingthe exception. For example, the exception manager 82 may handle theexception by sending an exception-handling instruction to theaccelerator 44, the data-processing application 80, or the configurationmanager 84. The exception manager 82 may send the exception-handlinginstruction to the accelerator 44 either via the same respectivechannels 104 _(p) (e.g., channel 104 ₁ of FIG. 5) through which thepublisher threads 100 (e.g., thread 100 ₁ of FIG. 5) publish data, orthrough dedicated exception-handling channels 104 (not shown) thatoperate as described above in conjunction with FIG. 5. If the exceptionmanager 82 sends instructions via other channels 104, then the objectfactory 98 (FIG. 4) generates the data-transfer objects 86 for thesechannels during initialization of the host processor 42 as describedabove in conjunction with FIG. 4. The exception manager 82 may publishexception-handling instructions to the data-processing application 80and to the configuration manager 84 either directly (as indicated by thedashed lines 85 and 89 in FIG. 4) or via the channels 104 _(dpa1) and104 _(dpa2) (application 80) and channels 104 _(cm1) and 104 _(cm2)(configuration manager 84), which the object factory 98 also generatesduring the initialization of the host processor 42.

[0091] Still referring to FIG. 6, as discussed below theexception-handling instructions may cause the accelerator 44,data-processing application 80, or configuration manager 84 to handlethe corresponding exception in a variety of ways.

[0092] When sent to the accelerator 44, the exception-handlinginstruction may change the soft configuration or the functioning of theaccelerator. For example, as discussed above, if the exception is abuffer overflow, the instruction may change the accelerator's softconfiguration (i.e., by changing the contents of a soft configurationregister) to increase the size of the buffer. Or, if a section of theaccelerator 44 that performs a particular operation is malfunctioning,the instruction may change the accelerator's functioning by causing theaccelerator to take the disabled section “off line.” In this lattercase, the exception manager 82 may, via additional instructions, causeanother section of the accelerator 44, or the data-processingapplication 80, to “take over” the operation from the disabledaccelerator section as discussed below. Altering the soft configurationof the accelerator 44 is further discussed in previously cited U.S.patent application Ser. No. ______ entitled PIPELINE ACCELERATOR FORIMPROVED COMPUTING ARCHITECTURE AND RELATED SYSTEM AND METHOD (AttorneyDocket No. 1934-13-3).

[0093] When sent to the data-processing application 80, theexception-handling instructions may cause the data-processingapplication to “take over” the operation of a disabled section of theaccelerator 44 that has been taken off line. Although the processingunit 62 (FIG. 3) may perform this operation more slowly and lessefficiently than the accelerator 44, this may be preferable to notperforming the operation at all. This ability to shift the performanceof an operation from the accelerator 44 to the processing unit 62increases the flexibility, reliability, maintainability, andfault-tolerance of the peer-vector machine 40 (FIG. 3).

[0094] And when sent to the configuration manager 84, theexception-handling instruction may cause the configuration manager tochange the hard configuration of the accelerator 44 so that theaccelerator can continue to perform the operation of a malfunctioningsection that has been taken off line. For example, if the accelerator 44has an unused section, then the configuration manager 84 may configurethis unused section to perform the operation that was to be themalfunctioning section. If the accelerator 44 has no unused section,then the configuration manager 84 may reconfigure a section of theaccelerator that currently performs a first operation to perform asecond operation of, i.e., take over for, the malfunctioning section.This technique may be useful where the first operation can be omittedbut the second operation cannot, or where the data-processingapplication 80 is more suited to perform the first operation than it isthe second operation. This ability to shift the performance of anoperation from one section of the accelerator 44 to another section ofthe accelerator increases the flexibility, reliability, maintainability,and fault-tolerance of the peer-vector machine 40 (FIG. 3).

[0095] Referring to FIG. 7, the configuration manager 84 loads thefirmware that defines the hard configuration of the accelerator 44during initialization of the peer-vector machine 40 (FIG. 3), and, asdiscussed above in conjunction with FIG. 6, may load firmware thatredefines the hard configuration of the accelerator in response to anexception according to an embodiment of the invention. As discussedbelow, the configuration manager 84 often reduces the complexity ofdesigning and modifying the accelerator 44 and increases thefault-tolerance, reliability, maintainability, and flexibility of thepeer-vector machine 40 (FIG. 3).

[0096] During initialization of the peer-vector machine 40, theconfiguration manager 84 receives configuration data from theaccelerator configuration registry 70, and loads configuration firmwareidentified by the configuration data. The configuration data areeffectively instructions to the configuration manager 84 for loading thefirmware. For example, if a section of the initialized accelerator 44performs an FFT, then one designs the configuration data so that thefirmware loaded by the manager 84 implements an FFT in this section ofthe accelerator. Consequently, one can modify the hard configuration ofthe accelerator 44 by merely generating or modifying the configurationdata before initialization of the peer-vector machine 40. Becausegenerating and modifying the configuration data is often easier thangenerating and modifying the firmware directly—particularly if theconfiguration data can instruct the configuration manager 84 to loadexisting firmware from a library—the configuration manager 84 typicallyreduces the complexity of designing and modifying the accelerator 44.

[0097] Before the configuration manager 84 loads the firmware identifiedby the configuration data, the configuration manager determines whetherthe accelerator 44 can support the configuration defined by theconfiguration data. For example, if the configuration data instructs theconfiguration manager 84 to load firmware for a particular PLIC (notshown) of the accelerator 44, then the configuration manager 84 confirmsthat the PLIC is present before loading the data. If the PLIC is notpresent, then the configuration manager 84 halts the initialization ofthe accelerator 44 and notifies an operator that the accelerator doesnot support the configuration.

[0098] After the configuration manager 84 confirms that the acceleratorsupports the defined configuration, the configuration manager loads thefirmware into the accelerator 44, which sets its hard configuration withthe firmware, e.g., by loading the firmware into the firmware memory 52.Typically, the configuration manager 84 sends the firmware to theaccelerator 44 via one or more channels 104 _(t) that are similar ingeneration, structure, and operation to the channels 104 of FIG. 5. Theconfiguration manager 84 may also receive data from the accelerator 44via one or more channels 104 _(u). For example, the accelerator 44 maysend confirmation of the successful setting of its hard configuration tothe configuration manager 84.

[0099] After the hard configuration of the accelerator 44 is set, theconfiguration manager 84 may set the accelerator's hard configuration inresponse to an exception-handling instruction from the exception manager84 as discussed above in conjunction with FIG. 6. In response to theexception-handling instruction, the configuration manager 84 downloadsthe appropriate configuration data from the registry 70, loadsreconfiguration firmware identified by the configuration data, and sendsthe firmware to the accelerator 44 via the channels 104 _(t). Theconfiguration manager 84 may receive confirmation of successfulreconfiguration from the accelerator 44 via the channels 104 _(u). Asdiscussed above in conjunction with FIG. 6, the configuration manager 84may receive the exception-handling instruction directly from theexception manager 82 via the line 89 (FIG. 4) or indirectly via thechannels 104 _(cm1) and 104 _(cm2).

[0100] The configuration manager 84 may also reconfigure thedata-processing application 80 in response to an exception-handlinginstruction from the exception manager 84 as discussed above inconjunction with FIG. 6. In response to the exception-handlinginstruction, the configuration manager 84 instructs the data-processingapplication 80 to reconfigure itself to perform an operation that, dueto malfunction or other reason, the accelerator 44 cannot perform. Theconfiguration manager 84 may so instruct the data-processing application80 directly via the line 87 (FIG. 4) or indirectly via channels 104_(dp1) and 104 _(dp2), and may receive information from thedata-processing application, such as confirmation of successfulreconfiguration, directly or via another channel 104 (not shown).Alternatively, the exception manager 82 may send an exception-handlinginstruction to the data-processing 80, which reconfigures itself, thusbypassing the configuration manager 82.

[0101] Still referring to FIG. 7, alternate embodiments of theconfiguration manager 82 are contemplated. For example, theconfiguration manager 82 may reconfigure the accelerator 44 or thedata-processing application 80 for reasons other than the occurrence ofan accelerator malfunction.

[0102] The preceding discussion is presented to enable a person skilledin the art to make and use the invention. Various modifications to theembodiments will be readily apparent to those skilled in the art, andthe generic principles herein may be applied to other embodiments andapplications without departing from the spirit and scope of the presentinvention. Thus, the present invention is not intended to be limited tothe embodiments shown, but is to be accorded the widest scope consistentwith the principles and features disclosed herein.

What is claimed is:
 1. A computing machine, comprising: a first buffer;a processor coupled to the buffer and operable to, execute anapplication, a first data-transfer object, and a second data-transferobject, publish data under the control of the application, load thepublished data into the buffer under the control of the firstdata-transfer object, and retrieve the published data from the bufferunder the control of the second data-transfer object.
 2. The computingmachine of claim 1 wherein the first and second data-transfer objectsrespectively comprise first and second instances of the same objectcode.
 3. The computing machine of claim 1 wherein the processorcomprises: a processing unit operable to execute the application andpublish the data under the control of the application; and adata-transfer handler operable to execute the first and seconddata-transfer objects, to load the published data into the buffer underthe control of the first data-transfer object, and to retrieve thepublished data under the control of the second data-transfer object. 4.The computing machine of claim 1 wherein the processor is furtheroperable to execute a thread of the application and to publish the dataunder the control of the thread.
 5. The computing machine of claim 1wherein the processor is further operable to: execute a queue object anda reader object; store a queue value under the control of the queueobject, the queue value reflecting the loading of the published datainto the buffer; read the queue value under the control of the readerobject; notify the second software object that the published dataoccupies the buffer under the control of the reader object and inresponse to the queue value; and retrieve the published data from thestorage location under the control of the second data-transfer objectand in response to the notification.
 6. The computing machine of claim1, further comprising: a bus; and wherein the processor is operable toexecute an communication object and to drive the retrieved data onto thebus under the control of the communication object.
 7. The computingmachine of claim 1, further comprising: a second buffer; and wherein theprocessor is operable to provide the retrieved data to the second bufferunder the control of the second data-transfer object.
 8. The computingmachine of claim 1 wherein the processor is further operable to generatea message that includes a header and the retrieved data under thecontrol of the second data-transfer object.
 9. The computing machine ofclaim 1 wherein: the first and second data-transfer objects respectivelycomprise first and second instances of the same object code; and theprocessor is operable to execute an object factory and to generate theobject code under the control of the object factory.
 10. A computingmachine, comprising: a first buffer; a processor coupled to the bufferand operable to, execute first and second data-transfer objects and anapplication, retrieve data and load the retrieved data into the bufferunder the control of the first data-transfer object, unload the datafrom the buffer under the control of the second data-transfer object,and process the unloaded data under the control of the application. 11.The computing machine of claim 10 wherein the first and seconddata-transfer objects respectively comprise first and second instancesof the same object code.
 12. The computing machine of claim 10 whereinthe processor comprises: a processing unit operable to execute theapplication and process the unloaded data under the control of theapplication; and a data-transfer handler operable to execute the firstand second data-transfer objects, to retrieve the data from the bus andload the data into the buffer under the control of the firstdata-transfer object, and to unload the data from the buffer under thecontrol of the second data-transfer object.
 13. The computing machine ofclaim 10 wherein the processor is further operable to execute a threadof the application and to process the unloaded data under the control ofthe thread.
 14. The computing machine of claim 10 wherein the processoris further operable to: execute a queue object and a reader object;store a queue value under the control of the queue object, the queuevalue reflecting the loading of the published data into the firstbuffer; read the queue value under the control of the reader object;notify the second data-transfer object that the published data occupiesthe buffer under the control of the reader object and in response to thequeue value; and unload the published data from the buffer under thecontrol of the second data-transfer object and in response to thenotification.
 15. The computing machine of claim 10, further comprising:a second buffer; and wherein the processor is operable to retrieve thedata from the second buffer under the control of the first data-transferobject.
 16. The computing machine of claim 10, further comprising: abus; and wherein the processor is operable to execute an communicationobject, to receive the data from the bus under the control of thecommunication object, and to retrieve the data from the communicationobject under the control of the first data-transfer object.
 17. Thecomputing machine of claim 10 wherein: the first and seconddata-transfer objects respectively comprise first and second instancesof the same object code; and the processor is operable to execute anobject factory and to generate the object code under the control of theobject factory.
 18. The computing machine of claim 10 wherein theprocessor is further operable to recover the data from a message thatincludes a header and the data under the control of the firstdata-transfer object.
 19. A peer-vector machine, comprising: a buffer; abus; a processor coupled to the buffer and to the bus and operable to,execute an application, first and second data-transfer objects, and ancommunication object, publish data under the control of the application,load the published data into the buffer under the control of the firstdata-transfer object, retrieve the published data from the buffer underthe control of the second data-transfer object, and drive the publisheddata onto the bus under the control of the communication object; and apipeline accelerator coupled to the bus and operable to receive thepublished data from the bus and to process the received published data.20. The peer-vector machine of claim 19 wherein: the processor isfurther operable to construct a message that includes the published dataunder the control of the second data-transfer object and to drive themessage onto the bus under the control of the communication object; andthe pipeline accelerator is operable to receive the message from the busand to recover the published data from the message.
 21. The peer-vectormachine of claim 19, further comprising: a registry coupled to the hostprocessor and operable to store object data; and wherein the processoris operable to, execute an object factory, and to generate the first andsecond data-transfer objects and the communication object from theobject data under the control of the object factory.
 22. A peer-vectormachine, comprising: a buffer; a bus; a pipeline accelerator coupled tothe bus and operable to generate data and to drive the data onto thebus; and a processor coupled to the buffer and to the bus and operableto, execute an application, first and second data-transfer objects, andan communication object, receive the data from the bus under the controlof the communication object, load the received data into the bufferunder the control of the first data-transfer object, unload the datafrom the buffer under the control of the second data-transfer object,and process the unloaded data under the control of the application. 23.The peer-vector machine of claim 22 wherein: the pipeline accelerator isfurther operable to construct a message that includes the data and todrive the message onto the bus; and the processor is operable to,receive the message from the bus under the control of the communicationobject, and recover the data from the message under the control of thefirst data-transfer object.
 24. The peer-vector machine of claim 22,further comprising: a registry coupled to the host processor andoperable to store object data; and wherein the processor is operable to,execute an object factory, and to generate the first and seconddata-transfer objects and the communication object from the object dataunder the control of the object factory.
 25. A peer-vector machine,comprising: a first buffer; a bus; a processor coupled to the buffer andto the bus and operable to, execute a configuration manager, first andsecond data-transfer objects, and a communication object, loadconfiguration firmware into the buffer under the control of theconfiguration manager and the first data-transfer object, retrieve theconfiguration firmware from the buffer under the control of the seconddata-transfer object, and drive the configuration firmware onto the busunder the control of the communication object; and a pipelineaccelerator coupled to the bus and operable to receive the configurationfirmware and to configure itself with the configuration firmware. 26.The peer-vector machine of claim 25 wherein: the processor is furtheroperable to construct a message that includes the configuration firmwareunder the control of the second data-transfer object and to drive themessage onto the bus under the control of the communication object; andthe pipeline accelerator is operable to receive the message from the busand to recover the configuration firmware from the message.
 27. Thepeer-vector machine of claim 25, further comprising: a registry coupledto the processor and operable to store configuration data; and whereinthe processor is operable to locate the configuration firmware from theconfiguration data under the control of the configuration manager. 28.The peer-vector machine of claim 25, further comprising: a secondbuffer; and wherein the processor is operable to: execute an applicationand third and fourth data-transfer objects, generate a configurationinstruction under the control of the configuration manager, load theconfiguration instruction into the second buffer under the control ofthe third data-transfer object, retrieve the configuration instructionfrom the second buffer under the control of the fourth data-transferobject, and configure the application to perform an operationcorresponding to the configuration instruction under the control of theapplication.
 29. The peer-vector machine of claim 25 wherein theprocessor is operable to: generate a configuration instruction under thecontrol of the configuration manager; and configure the application toperform an operation corresponding to the configuration instructionunder the control of the application.
 30. The peer-vector machine ofclaim 25 wherein the configuration manager is operable to confirm thatthe pipeline accelerator supports a configuration defined by theconfiguration data before loading the firmware.
 31. A peer-vectormachine, comprising: a first buffer; a bus; a pipeline acceleratorcoupled to the bus and operable to generate exception data and to drivethe exception data onto the bus; and a processor coupled to the bufferand to the bus and operable to, execute an exception manager, first andsecond data-transfer objects, and an communication object, receive theexception data from the bus under the control of the communicationobject, load the received exception data into the buffer under thecontrol of the first data-transfer object, unload the exception datafrom the buffer under the control of the second data-transfer object,and process the unloaded exception data under the control of theexception manager.
 32. The peer-vector machine of claim 31 wherein: thepipeline is further operable to construct a message that includes theexception data and to drive the message onto the bus; and the processoris operable to receive the message from the bus under the control of thecommunication object and to recover the exception data from the messageunder the control of the first data-transfer object.
 33. The peer-vectormachine of claim 31, further comprising: a second buffer; wherein theprocessor is further operable to, execute a configuration manager andthird and fourth data-transfer objects, generate configuration firmwareunder the control of the configuration manager in response to theexception data, load the configuration firmware into the second bufferunder the control of the third data-transfer object, unload theconfiguration instruction from the second buffer under the control ofthe fourth data-transfer object, and drive the configuration firmwareonto the bus under the control of the communication object; and whereinthe pipeline accelerator is operable to receive the configurationfirmware from the bus and reconfigure itself with the firmware.
 34. Thepeer-vector machine of claim 31 wherein the processor is furtheroperable to: execute an application and a configuration manager;generate a configuration instruction under the control of theconfiguration manager in response to the exception data; and reconfigurethe application under the control of the application in response to theconfiguration instruction.
 35. A peer-vector machine, comprising: aconfiguration registry operable to store configuration data; a processorcoupled to the configuration registry and operable to locateconfiguration firmware from the configuration data; and a pipelineaccelerator coupled to the processor and operable to configure itselfwith the configuration firmware.
 36. A peer-vector machine, comprising:a configuration registry operable to store configuration data; apipeline accelerator; and a processor coupled to the configurationregistry and to the pipeline accelerator and operable to retrieveconfiguration firmware in response to the configuration data and toconfigure the pipeline accelerator with the configuration firmware. 37.A method, comprising: publishing data with an application; loading thepublished data into a first buffer with a first data-transfer object;and retrieving the published data from the buffer with a seconddata-transfer object.
 38. The method of claim 37 wherein publishing thedata comprises publishing the data with a thread of the application. 39.The method of claim 37, further comprising: generating a queue valuethat corresponds to the presence of the published data in the buffer;notifying the second data-transfer object that the published dataoccupies the buffer in response to the queue value; and whereinretrieving the published data comprises retrieving the published datafrom the storage location with the second data-transfer object inresponse to the notification.
 40. The method of claim 37, furthercomprising driving the retrieved data onto a bus with a communicationobject.
 41. The method of claim 37, further comprising loading theretrieved data into a second buffer with the second data-transferobject.
 42. The method of claim 37, further comprising: generating aheader for the retrieved data with the second data-transfer object; andcombining the header and the retrieved data into a message with thesecond data-transfer object.
 43. The method of claim 37, furthercomprising: generating data-transfer object code with an object factory;generating the first data-transfer object as a first instance of theobject code; and generating the second data-transfer object as a secondinstance of the object code.
 44. The method of claim 37, furthercomprising receiving and processing the data from the seconddata-transfer object with a pipeline accelerator.
 45. A method,comprising: retrieving data and loading the retrieved data into a firstbuffer with a first data-transfer object, unloading the data from thebuffer with a second data-transfer object; and processing the unloadeddata with an application.
 46. The method of claim 45 wherein processingthe unloaded data comprises processing the unloaded data with a threadof the application.
 47. The method of claim 45, further comprising:generating a queue value that corresponds to the presence of the data inthe buffer; notifying the second data-transfer object that the dataoccupies the buffer in response to the queue value; and whereinunloading the data comprises unloading the data from the buffer with thefirst data-transfer object in response to the notification.
 48. Themethod of claim 45 wherein retrieving the data comprises retrieving thedata from a second buffer with the first data-transfer object.
 49. Themethod of claim 45, further comprising: receiving the data from a buswith an communication object; and wherein retrieving the data comprisesretrieving the data from the communication object under with the firstdata-transfer object.
 50. The method of claim 45, further comprisingproviding the data to the first data-transfer object with a pipelineaccelerator.
 51. A method, comprising: publishing data with anapplication running on a processor; loading the published data into abuffer with a first data-transfer object running on the processor;retrieving the published data from the buffer with a seconddata-transfer object running on the processor; driving the retrievedpublished data onto a bus with an communication object running on theprocessor; and receiving the published data from the bus and processingthe published data with a pipeline accelerator.
 52. The method of claim51, further comprising: generating a message that includes a header andthe published data with the second data-transfer object; wherein drivingthe data onto the bus comprises driving the message onto the bus withthe communication object; and receiving and processing the publisheddata comprises receiving the message and recovering the published datafrom the message with the pipeline accelerator.
 53. A method,comprising: generating data and driving the data onto a bus with apipeline accelerator; receiving the data from the bus with thecommunication object; loading the received data into a buffer under witha first data-transfer object; unloading the data from the buffer with asecond data-transfer object; and processing the unloaded data with anapplication.
 54. The method of claim 53, further comprising: whereingenerating the data comprises constructing a message that includes aheader and the data with the pipeline accelerator; wherein driving thedata comprises driving the message onto the bus with the pipelineaccelerator; wherein receiving the data comprises receiving the messagefrom the bus with the communication object; and recovering the data fromthe message with the first data-transfer object.
 55. A method,comprising: retrieving configuration firmware with a configurationmanager; loading the configuration firmware into a first buffer with afirst communication object; retrieving the configuration firmware fromthe buffer with a second communication object; driving the configurationfirmware onto a bus with an communication object; receiving theconfiguration firmware with a pipeline accelerator; and configuring thepipeline accelerator with the configuration firmware.
 56. The method ofclaim 55, further comprising: generating a configuration instructionwith the configuration manager; and configuring the application toperform an operation corresponding to the configuration instruction. 57.The method of claim 55, further comprising: generating a configurationinstruction with the configuration manager; loading the configurationinstruction into a second buffer with a third communication object;retrieving the configuration instruction from the second buffer with afourth communication object; and configuring the application to performan operation corresponding to the configuration instruction.
 58. Amethod, comprising: generating exception data and driving the exceptiondata onto a bus with a pipeline accelerator; receiving the exceptiondata from the bus with a communication object; loading the receivedexception data into a buffer with a first data-transfer object;unloading the exception data from the buffer with a second data-transferobject; and processing the unloaded exception data under with anexception manager.
 59. The method of claim 58, further comprising:retrieving configuration firmware with a configuration manager inresponse to the exception data, loading the configuration firmware intoa second buffer with a third transfer object; unloading theconfiguration instruction from the second buffer with a fourthdata-transfer object; driving the configuration firmware onto the buswith the communication object; and reconfiguring the pipelineaccelerator with the configuration firmware.
 60. The method of claim 58,further comprising: generating a configuration instruction with aconfiguration manager in response to the error data; and reconfiguringthe application in response to the configuration instruction.
 61. Amethod, comprising: retrieving configuration firmware pointed to byconfiguration data stored in a configuration registry during aninitialization of a computing machine; and configuring a pipelineaccelerator of the computing machine with the configuration firmware.